Preparations

Load the necessary libraries

library(car) # for regression diagnostics
library(broom) # for tidy output
library(ggfortify) # for model diagnostics
library(sjPlot) # for outputs
library(knitr) # for kable
library(effects) # for partial effects plots
library(emmeans) # for estimating marginal means
library(ggeffects) # for partial effects plots
library(MASS) # for glm.nb
library(MuMIn) # for AICc
library(DHARMa) # for residual diagnostics plots
library(modelr) # for auxillary modelling functions
library(performance) # for residuals diagnostics
library(see) # for plotting residuals
library(patchwork) # for grids of plots
library(tidyverse) # for data wrangling

Scenario

Loyn (1987) modelled the abundance of forest birds with six predictor variables (patch area, distance to nearest patch, distance to nearest larger patch, grazing intensity, altitude and years since the patch had been isolated).

Regent honeyeater

Format of loyn.csv data file

ABUND DIST LDIST AREA GRAZE ALT YR.ISOL
.. .. .. .. .. .. ..
ABUND Abundance of forest birds in patch- response variable
DIST Distance to nearest patch - predictor variable
LDIST Distance to nearest larger patch - predictor variable
AREA Size of the patch - predictor variable
GRAZE Grazing intensity (1 to 5, representing light to heavy) - predictor variable
ALT Altitude - predictor variable
YR.ISOL Number of years since the patch was isolated - predictor variable

The aim of the analysis is to investigate the effects of a range of predictors on the abundance of forest birds.

Read in the data

loyn <- read_csv("../public/data/loyn.csv", trim_ws = TRUE)
glimpse(loyn)
## Rows: 56
## Columns: 7
## $ ABUND   <dbl> 5.3, 2.0, 1.5, 17.1, 13.8, 14.1, 3.8, 2.2, 3.3, 3.0, 27.6, 1.8…
## $ AREA    <dbl> 0.1, 0.5, 0.5, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 2.0, 2.…
## $ YR.ISOL <dbl> 1968, 1920, 1900, 1966, 1918, 1965, 1955, 1920, 1965, 1900, 19…
## $ DIST    <dbl> 39, 234, 104, 66, 246, 234, 467, 284, 156, 311, 66, 93, 39, 40…
## $ LDIST   <dbl> 39, 234, 311, 66, 246, 285, 467, 1829, 156, 571, 332, 93, 39, …
## $ GRAZE   <dbl> 2, 5, 5, 3, 5, 3, 5, 5, 4, 5, 3, 5, 2, 1, 5, 5, 3, 3, 3, 2, 2,…
## $ ALT     <dbl> 160, 60, 140, 160, 140, 130, 90, 60, 130, 130, 210, 160, 210, …

Exploratory data analysis

This is an application of multiple regression. The response variable in this instance is a little awkward in that it appears to be an average of multiple point quadrats rather than a pure count. Central limits theorem suggests that averages should follow a normal distribution and thus we might expect that it is reasonable to model this bird abundance against a Gaussian distribution. However, this has the potential to become problematic since a Gaussian distribution can extend below zero whereas this is clearly not logical for bird abundance. As a result, we might find the at the model can predict bird abundances less than zero.

If this were our own analysis, we might first attempt to model these data against a Gaussian distribution and then explore whether this could lead to negative predictions - it may be that the bird abundances are sufficiently high that the issue of negative predictions is not realised over sensible ranges of the predictors. If it turns out that there is an issue, then we would explore alternatives.

In this case, there is an issue - the model does indeed predict fewer than zero birds for some ranges of some of the predictors. Hence, we will skip straight to the alternatives:

  • we could log-transform the response. \[ log(y_i) = \mathcal{N}(\mu_i, \sigma^2)\\ \mu_i = \beta_0 + \beta_1 X_1 + ... \]
    • the expected values will be the mean of logs
    • due to the transformation, both \(\mu\) and \(\sigma^2\) are modelled on a log scale. \[ y_i \sim{} \mathcal{N}(e^{X\beta}, e^{\sigma^2})\\ \]
    • when back-transforming, this implies that the variances will be unequal, in fact they will increase with the mean. \[ log(y_i) = X\beta + \varepsilon \hspace{1cm} \varepsilon \sim{} \mathcal{N}(0, \sigma^2) \]
  • we could model the data using a Gaussian distribution with a log link. \[ y_i = \mathcal{N}(\mu_i, \sigma^2)\\ log(\mu_i) = \beta_0 + \beta_1 X_1 + ... \]
    • the expected value will he the log of means \[ y_i \sim{} \mathcal{N}(e^{X\beta}, \sigma^2)\\ \]
    • the effects (slopes) become multiplicative \[ log(y_i + \varepsilon) = \beta_0 + \beta_1 X_1 + ... \hspace{1cm} \varepsilon \sim{} \mathcal{N}(0, \sigma^2) \]
  • we could model the data using a Gamma distribution with a log link. \[ y_i = \mathcal{Gamma}(\mu_i, \sigma^2)\\ log(\mu_i) = \beta_0 + \beta_1 X_1 + ... \]

We will explore each of these options:

  • in doing so, we need to consider the patterns of variance
  • linearity
  • (multi)collinearity - correlated predictors should not be in the same model together lest they compete.

Scatterplot matrix

Specific scatterplots

Fit the model

Gaussian (log-transformed)

Model formula:

\[ log(y_i) = \mathcal{N}(\mu_i, \sigma^2)\\ \mu_i = \beta_0 + \beta_1 X_1 + ... \]

where \(\boldsymbol{\beta}\) is a vector of effects parameters and \(\bf{X}\) is a model matrix representing the additive effects of the scaled versions of distance (ln), distance to the nearest large patch (ln), patch area (ln), grazing intensity, year of isolation and altitude on the abundance of forest birds.

Model validation

Partial plots

Caterpillar plot

Gaussian (log-transformed)

loyn.glm %>% plot_model(type = "est")

loyn.glm %>% plot_model(type = "est", transform = "exp", show.values = TRUE)

Model investigation / hypothesis testing

Further analyses

From this point on, we will only proceed with the Gaussian (log-link) model.

Summary figures

References

Loyn, R. H. 1987. “Nature Conservation: The Role of Remnants of Native Vegetation.” In, edited by D. A. Saunders, G. W. Arnold, A. A. Burbridge, and A. J. M. Hopkins. Chipping Norton, NSW: Surrey Beatty & Sons.